Take Home Exercise 3

Putting Visual Analytics into Practical Use.

Mak Han Ren https://www.linkedin.com/in/mak-han-ren/ (School of Computing and Information Systems, SMU)https://scis.smu.edu.sg
2022-02-19

1.0 The Task

We are tasked to create a data visualisation to segment kid drinks and other by nutrition indicators. For the purpose of this task, starbucks_drink.csv should be used.

1.1 Task Considerations

Since we are doing a segmentation of kid drinks via nutritional indicators, we will be paying more attention towards the nutritional values to build a data visualisation that helps us tell a story of whether kids’ drinks in Starbucks are healthy or should parents be treating their kids to just plain water at Starbucks instead as the healthier choice.

Upon observing the variables, we can note down 13 different nutritional indicators - Portion (fl oz), Calories, Calories from fat, Total Fat(g), Saturated fat(g), Trans fat(g), Cholesterol(mg), Sodium(mg), Total Carbohydrate(g), Dietary Fiber(g), Sugars(g), Protein(g), Caffeine(mg).

However, upon further inspection we notice that Portion (fl oz) may not be an effective nutritional indicator since it should be correlated to the drink size ordered. As such, we will first work on determining if there is any correlation between portion and the other nutritional indicators before moving onto buildling our data visualisation.

The last step is in choosing the best kind of illustration to tell our data story on how kids should avoid drinking Starbucks due to the lack of nutritional value. And given the data visualisations we were exposed to in lesson 4, there is a choice between parallel coordinates graph or a heatmap. In this case, a heat map will be the superior choice given how we can combine it with hierachical clustering to determine the clustering of nutritional indicators.

As such, there are two parts to our tasks where (1) We will be building a correlogram to determine correlation between Portion and other nutritional indicators and (2) A heat map of Starbucks drinks to determine the level of nutritional indicators in each drink.

2.0 Installing and Loading the Required Packages

We will be using the following packages:

packages = c('seriation', 'dendextend', 'heatmaply','corrplot', 'tidyverse','kableExtra')

for(p in packages){library
  if(!require(p, character.only = T)){
    install.packages(p)
  }
  library(p, character.only = T)
}

3.0 Loading the dataset

As mentioned earlier, we will be using the “starbucks_drink.csv” dataset for to perform our task

sb <- read_csv("data/starbucks_drink.csv")
kable(head(sb))
Category Name Portion(fl oz) Calories Calories from fat Total Fat(g) Saturated fat(g) Trans fat(g) Cholesterol(mg) Sodium(mg) Total Carbohydrate(g) Dietary Fiber(g) Sugars(g) Protein(g) Caffeine(mg) Size Milk Whipped Cream
iced-coffee Cold Brew with Cascara Cold Foam 12 50 0 0 0 0 0 25 11 0 11 1 145 Tall NA NA
iced-coffee Cold Brew with Cascara Cold Foam 16 80 0 0 0 0 0 30 17 0 17 2 190 Grande NA NA
iced-coffee Cold Brew with Cascara Cold Foam 24 100 0 0 0 0 0 40 22 0 22 2 280 Venti Iced NA NA
iced-coffee Cold Brew with Cascara Cold Foam 30 130 0 0 0 0 0 45 28 0 28 2 320 Trenta Iced NA NA
iced-coffee Iced Coffee 30 160 0 0 0 0 0 15 40 0 39 1 280 Trenta Iced NA Sweetened
iced-coffee Iced Coffee 30 5 0 0 0 0 0 10 0 0 0 1 330 Trenta Iced NA Unsweetened

4.0 Data Wrangling

For the purpose of this task, we are focused solely on kids’ drinks so we will excluding other non-required rows in our dataset.

kids_sb <-sb %>% filter(Category == 'kids-drinks-and-other')
kable(head(kids_sb))
Category Name Portion(fl oz) Calories Calories from fat Total Fat(g) Saturated fat(g) Trans fat(g) Cholesterol(mg) Sodium(mg) Total Carbohydrate(g) Dietary Fiber(g) Sugars(g) Protein(g) Caffeine(mg) Size Milk Whipped Cream
kids-drinks-and-other Cinnamon Dolce Crème 12 140 45 5 0 0 0 120 25 1 22 2 0 Tall Almond No Whipped Cream
kids-drinks-and-other Cinnamon Dolce Crème 12 210 100 11 4 0 20 125 27 1 25 2 0 Tall Almond Whipped Cream
kids-drinks-and-other Cinnamon Dolce Crème 12 170 50 6 5 0 0 130 28 0 26 1 0 Tall Coconut No Whipped Cream
kids-drinks-and-other Cinnamon Dolce Crème 12 230 110 12 9 0 20 135 30 1 28 1 0 Tall Coconut Whipped Cream
kids-drinks-and-other Cinnamon Dolce Crème 12 170 0 0 0 0 5 120 32 0 31 10 0 Tall Nonfat milk No Whipped Cream
kids-drinks-and-other Cinnamon Dolce Crème 12 230 60 6 4 0 25 125 34 0 33 10 0 Tall Nonfat milk Whipped Cream

From the details above, we notice that Caffeine(mg) is not classified correctly as it is classified as a character rather than a numerical value instead. As such, we will be converting it into numerical format before moving onto further data analysis tasks.

kids_sb$`Caffeine(mg)` <- parse_number(kids_sb$`Caffeine(mg)`)

Now we can see that Caffeine(mg) has been classified in the correct format.

And with this, we can finally move onto our data visualisation tasks

5.0 Task 1: Correlogram of Nutritional Values

As we mentioned earlier, we would want to determine if there are nutritional indicators strongly correlated to each other and whether Portion (fl oz) has high correlation with any other nutritional indicators.

5.1 Further Data Wrangling

Before we being on the task, we will need to filter out only the nutritional indicators for analysis.

kids_sb.cor <- cor(kids_sb[, 3:15])

5.2 Building the Correlogram

After filtering the data, we will be using corrplot() to plot the correlogram.

corrplot.mixed(kids_sb.cor, 
               lower = "ellipse", 
               upper = "number",
               tl.pos = "lt",
               diag = "l",
               order="AOE",
               tl.col = "black")

Given how the correlation values are too big for the graph, we will be adjusting their sizes using tl.cex and number.cex.

corrplot.mixed(kids_sb.cor, 
               lower = "ellipse", 
               upper = "number",
               tl.pos = "lt",
               diag = "l",
               order="AOE",
               tl.col = "black",
               tl.cex = 0.6,
               number.cex = 0.6)

5.3 Analysis of Correlogram

Here are some interesting stats we observed from the correlogram:

Based on these observations, we will be reducing the dataset by dividing the relevant nutritional indicators with the mean of the Portion size.

6.0 Task 2: Heatmap of Nutritional Indicators

As mentioned in section 1.1 we will be building a heat map to show the relationship between kids’ drinks and nutritional indicators.

6.1 Data Wrangling

We first adjust the drinks name by combining the columns Name, Milk and Whipped Cream using the paste() function.

kids_sb$DrinkName = paste(kids_sb$Name,kids_sb$Milk, kids_sb$`Whipped Cream`)

We then need to collapse the dataset by using groupby() and dividing the dataset by the Portion (fl oz) mean.

kids_sb2 <- kids_sb %>%
  group_by(`DrinkName`) %>%
  summarise('Calories' = sum(`Calories`)/sum(`Portion(fl oz)`),
           'Calories from fat'  = sum(`Calories from fat`)/sum(`Portion(fl oz)`),
           'Total Fat(g)' = sum(`Total Fat(g)`)/sum(`Portion(fl oz)`),
           'Saturated fat(g)' = sum(`Saturated fat(g)`)/sum(`Portion(fl oz)`),
           'Trans fat(g)' = sum(`Trans fat(g)`)/sum(`Portion(fl oz)`),
           'Cholesterol(mg)' = sum(`Cholesterol(mg)`)/sum(`Portion(fl oz)`),
           'Sodium(mg)' = sum(`Sodium(mg)`)/sum(`Portion(fl oz)`),
           'Total Carbohydrate(g)' = sum(`Total Carbohydrate(g)`)/sum(`Portion(fl oz)`),
           'Dietary Fiber(g)' = sum(`Dietary Fiber(g)`)/sum(`Portion(fl oz)`),
           'Sugars(g)' = sum(`Sugars(g)`)/sum(`Portion(fl oz)`),
           'Protein(g)' = sum(`Protein(g)`)/sum(`Portion(fl oz)`),
           'Caffeine(mg)' = sum(`Caffeine(mg)`)/sum(`Portion(fl oz)`)) %>%
  ungroup()
kable(head(kids_sb2))
DrinkName Calories Calories from fat Total Fat(g) Saturated fat(g) Trans fat(g) Cholesterol(mg) Sodium(mg) Total Carbohydrate(g) Dietary Fiber(g) Sugars(g) Protein(g) Caffeine(mg)
Cinnamon Dolce Crème 2% Milk No Whipped Cream 17.70833 4.166667 0.4583333 0.2916667 0.0000000 1.979167 10.93750 2.645833 0.0000000 2.541667 0.7708333 0
Cinnamon Dolce Crème 2% Milk Whipped Cream 22.08333 7.916667 0.8750000 0.5416667 0.0208333 3.125000 11.45833 2.770833 0.0000000 2.708333 0.8125000 0
Cinnamon Dolce Crème Almond No Whipped Cream 11.87500 3.645833 0.3958333 0.0208333 0.0000000 0.000000 10.00000 2.083333 0.1041667 1.854167 0.1458333 0
Cinnamon Dolce Crème Almond Whipped Cream 16.25000 7.291667 0.8125000 0.2916667 0.0000000 1.250000 10.31250 2.229167 0.1041667 2.020833 0.1875000 0
Cinnamon Dolce Crème Coconut No Whipped Cream 13.95833 4.375000 0.5000000 0.4375000 0.0000000 0.000000 10.62500 2.333333 0.0416667 2.145833 0.0625000 0
Cinnamon Dolce Crème Coconut Whipped Cream 18.54167 8.125000 0.9166667 0.6875000 0.0000000 1.250000 10.93750 2.479167 0.0625000 2.312500 0.1041667 0

From the graph above, we see that there are a total number of 60 unique drinks.

We then need to set the drink names as the row number before transforming the new dataset into a data matrix so we can build a heat map.

row.names(kids_sb2) <- kids_sb2$DrinkName
kids_sb_matrix <- data.matrix(kids_sb2)

6.3 Building the Heat Map

We will be building the heat map using heatmaply(). We will first build a test heat map with the default clusters before identifying the best number of clusters later.

heatmaply(normalize(kids_sb_matrix[, -c(1)]),
          Colv=NA,
          seriate = "none",
          colors = Greens,
          fontsize_row = 4,
          fontsize_col = 5,
          )

6.4 Identifying the Best Number of Clusters

And now to make the heat map better, we will be identifying the best clustering method and the best number of clusters.

To find the best clustering method, we will be utilising dend_expend().

kids_sb_matrix2 <- dist(normalize(kids_sb_matrix[, -c(1)]), method = "euclidean")
dend_expend(kids_sb_matrix2)[[3]]
  dist_methods hclust_methods     optim
1      unknown         ward.D 0.5614832
2      unknown        ward.D2 0.6088735
3      unknown         single 0.6646756
4      unknown       complete 0.6243221
5      unknown        average 0.7387914
6      unknown       mcquitty 0.6958625
7      unknown         median 0.5369151
8      unknown       centroid 0.6061457

The output indicates that the ‘average’ method should be used since it has the highest optimum value.

And to determine the best number of clusters, we will be using find_k().

kids_sb_cluster <- hclust(kids_sb_matrix2, method = "average")
kids_sb_k <- find_k(kids_sb_cluster)
plot(kids_sb_k)

From the figure above, we see that k = 10 is the optimal number of clusters.

6.5 Replotting the Heat Map

With the best clustering method and clusters identified earlier, we will then replot the heat map while adding in more details such as the titles and labels.

heatmaply(normalize(kids_sb_matrix[,-c(1)]),
          dist_method = "euclidean",
          hclust_method = "average",
          seriate = "none",
          show_dendrogram = c(TRUE, FALSE),
          k_row = 10,
          colors = Greens,
          margins = c(NA,200,60,NA),
          fontsize_row = 4,
          fontsize_col = 5,
          xlab = "Nutritional Indicators",
          ylab = "Drink Types",
          main="Starbucks Kids' Drinks nutrition \nindicator by Drink Types",
          Colv = NA
          )

7.0 Heat Map Findings

From the heat map, we can see that drinks containing Salted Caramel have the highest amount of calories due to a high amount of sodium, cholesterol and carybohydrates. This shows that salted caramel is the most unhealthy ingredient in Starbucks and kids should avoid it if they can.

We can also see that Hot Chocolate drinks also have a high amount of calories, total fat, cholesterol and sugars. This is further exacerbated by the fact that kids like to order it in combination with Whipped Cream and Salted Caramel.

As mentioned earlier, Whipped Cream also contributes to a high amount of calories due to a higher amount of total fats. As such, kids should try their best to avoid ordering whipped cream to reduce their calorie count.

And upon further observation, drinks that contain any form of Milk has a higher amount of calories due to a higher amount of saturated fats which leads to a higher amount of total fats. Kids should be aware of this and try to avoid adding milk to their drinks.

Interestingly, Hot Chocolate and Pumpkin Spice drinks have the highest amount of caffeine compared to the other drinks. As such, kids should avoid these drinks if they can or they would be packed with caffeine and be restless the entire day.

As for the healthy drinks, consumers should go for Creme drinks which has a lower calorie count compared to the other drinks. They can consume this in combination with no whipped cream and no milk for the lowest amount of calories.